State of the Field

Data Science for Public Policy

Aaron R. Williams and Alex Engler

Data Science for Public Policy

What is a Data Scientist?

Source: Urban Institute

DS in the Context of Public Policy Analysis

Expanded Perspectives on Data and Methods

  • Web Scraping
  • Imputation
  • Text Analysis and NLP
  • Computer Vision
  • Data Privacy
  • Cluster Analysis
  • Network Analysis
  • Predictive Modeling
  • Causal Inference +

Modern Tools for Learning about Data

  • Open-Source Programming Languages
  • Reproducible Research
  • Cloud Computing
  • APIs, Databases, and Infrastructure
  • Data Visualization and Interactivity
  • Algorithms from Statistics, Computer - Science, and Cryptography

Expanded Perspectives on Data and Methods

  1. Getting Data
  2. Finding Structure in Data
  3. Making Predictions with Data
  4. Causal Inference +

Source: Valerie Tygart

1. Getting Data

“Don’t Sleep on Summary Statistics”

Source: Washington Post

Web Scraping

The Billion Prices Project

Hand Coding + Machine Learning

How Urban Piloted Data Science Techniques to Collect Land-Use Reform Data

NLP + Summarizing Massive Data

Public Perceptions of Police on Social Media

The millions of tweets shared on Twitter daily are a rich resource of public sentiment on countless topics. In the wake of highly publicized officer-involved shootings, many people take to social media to express their opinions, both positive and negative, of the police. We collected millions of public tweets and employed machine learning to explore whether we can measure public sentiment toward the police. Specifically, we examine how public sentiment changed over time and in response to one high-profile event, the 2015 death of Freddie Gray in Baltimore. While accounting for the larger trends in the public image of the police on Twitter, we find that sentiment became significantly more negative after Gray’s death and during the subsequent protests.

Computer Vision

Data Privacy

Source: Urban Institute

Data Privacy

Source: Nature

2. Finding Structure in Data

Cluster Analysis

Beyond Red vs. Blue: The Political Typology

This report uses cluster analysis to sort people into cohesive groups, based on their responses to 23 questions covering an array of political attitudes and values. First developed in 1987, the Pew Research Center’s Political Typology has provided a portrait of the electorate at various points across five presidencies; the last typology study was released in May 2011.

Network Analysis

Source: Network Propaganda Explored

3. Making Predictions with Data

Targeting Interventions

Lead poisoning is a major public health problem that affects hundreds of thousands of children in the United States every year. A common approach to identifying lead hazards is to test all children for elevated blood lead levels and then investigate and remediate the homes of children with elevated tests. This can prevent exposure to lead of future residents, but only after a child has been poisoned. This paper describes joint work with the Chicago Department of Public Health (CDPH) in which we build a model that predicts the risk of a child to being poisoned so that an intervention can take place before that happens. Using two decades of blood lead level tests, home lead inspections, property value assessments, and census data, our model allows inspectors to prioritize houses on an intractably long list of potential hazards and identify children who are at the highest risk. This work has been described by CDPH as pioneering in the use of machine learning and predictive analytics in public health and has the potential to have a significant impact on both health and economic outcomes for communities across the US.

Predictive Modeling for Public Health: Preventing Childhood Lead Poisoning

Data Imputation

Ethics and Empathy in Using Imputation to Disaggregate Data for Racial Equity, A Case Study Imputing Credit Bureau Data

Disaggregating data by race and ethnicity is a critical method for shining light on racialized systems of privilege and oppression. Imputation is a powerful tool for disaggregating data by generating racial and ethnic identifiers onto datasets lacking this information. But if used without a proactive focus on equity, it can harm Black people, Indigenous people, and other people of color.

4. Causal Inference +

Using Prediction to Add Controls

Econometrics +

  • Estimating the heterogeneity of treatment effects
  • Using modified regression trees as a standard robustness check for regression analysis
  • Understanding uncertainty when a researcher observes the entire population

Source: “Machine Learning and Causal Inference for Policy Evaluation” by Susan Athey

Modern Tools for Learning About Data

Tools

  • Proprietary Programming Languages
  • PDFs and Tables
  • Public Data
  • Causal Inference/Econometrics
  • Local Computing
  • Experimental Design
  • Microsimulation

Tools

  • Proprietary Open Source Programming Languages
  • PDFs and Tables + DataVisualization and Interactivity
  • Public Data + Private Data
  • Causal Inference/Econometrics + Algorithms from Statistics, Computer Science, and Cryptography
  • Local Computing + Cloud Computing
  • Experimental Design
  • Microsimulation
  • Reproducible Research and Workflows
  • APIs, Databases, and Infrastructure